Non-emoji numerals are detected as emoji by kainosnoema · Pull Request #3 · toddkramer/EmojiTools

kainosnoema · 2016-07-22T23:03:58Z

Non-emoji numerals are treated as emoji. e.g. this fails:

XCTAssertFalse("1234567890".containsEmoji())

This is because String.unicodeScalers() splits emojis into their codepoints, which for some characters yields standard ASCII. As an example, here are the codepoints for the "0 in a box" emoji:

- [1065] : "0"
- [1066] : "\u{FE0F}"
- [1067] : "\u{20E3}"

It's not as easy as removing ASCII characters from the list of unicode scalars, since that would break the implementation of containsEmojiOnly(). One solution would be to find a way to split strings into their composed character sequences, but then you'd have to also combine all possible modifier permutations. Still thinking of the proper way to solve this.

kainosnoema · 2016-07-11T23:57:38Z

Progress: it seems that the only way to properly detect a sequence of codepoints is using enumerateSubstringsInRange(startIndex..<endIndex, options: .ByComposedCharacterSequences). Using this method, we can break down both the emojis and the input string into character sequences, then comparison can be done properly.

The one hitch to this solution is the one I mentioned about modifiers, but that can be handled by checking if the sequence is made up of two codepoints, the first one being an emoji and the second one being a modifier.

I'm working on a pull-request now with this approach, adding tests as I go.

Due to the use of `unicodeScalars` previously, some ASCII characters were being identified as emoji. In particular, the "Keycap Digit N" characters are composed of the ASCII character followed by two other codepoints. Keycap Digit Zero, for example, contains these scalars: ``` - [1065] : "0" - [1066] : "\u{FE0F}" - [1067] : "\u{20E3}" ``` In order to properly handle these sequences without false positives, we have to split emojis into their composed character sequences and store those as a set instead. The one complication here is that there are many permutations of emoji with the skin tone modifiers. Instead of storing each of these, we simply check if a character sequence has two codepoints, and if so, that the first character is an emoji and the second is a skin tone modifier. This is a fairly simple and efficient way to accurately identify the presence of valid emoji. Signed-off-by: Evan Owen <kainosnoema@gmail.com>

kainosnoema · 2016-07-22T23:07:46Z

Alright, here's a stab at fixing things. It requires a dramatically different approach to emoji detection, but it seems to be the most straightforward way to accurately detect emoji without false-positives on ASCII digits. Performance is good too after the first enumeration of all emoji sequences.

Because it's so different though, you may have some suggestions on how to improve.

Edit: The other major change here is that I've removed UnicodeScalar.isEmoji(), since that doesn't really make any sense—many emoji are made up of multiple UnicodeScalar. It's a breaking API change, but maybe one that wasn't intended to be used?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-emoji numerals are detected as emoji#3

Non-emoji numerals are detected as emoji#3
kainosnoema wants to merge 1 commit intotoddkramer:masterfrom
cotap:handle-composed-sequences

kainosnoema commented Jul 22, 2016 •

edited

Loading

Uh oh!

kainosnoema commented Jul 11, 2016

Uh oh!

kainosnoema commented Jul 22, 2016 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kainosnoema commented Jul 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kainosnoema commented Jul 11, 2016

Uh oh!

kainosnoema commented Jul 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kainosnoema commented Jul 22, 2016 •

edited

Loading

kainosnoema commented Jul 22, 2016 •

edited

Loading